{
  "cells": [
    {
      "cell_type": "markdown",
      "id": "635d8ebb",
      "metadata": {},
      "source": [
        "# Summary Evaluators\n",
        "\n",
        "- Author: [Youngjun Cho](https://github.com/choincnp)\n",
        "- Design: \n",
        "- Peer Review: \n",
        "- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)\n",
        "\n",
        "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/99-TEMPLATE/00-BASE-TEMPLATE-EXAMPLE.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/99-TEMPLATE/00-BASE-TEMPLATE-EXAMPLE.ipynb)\n",
        "\n",
        "## Overview\n",
        "\n",
        "This document provides a comprehensive guide to building and evaluating RAG systems using ```LangChain``` tools. It demonstrates how to define RAG performance testing functions and utilize summary evaluators for relevance assessment. By leveraging models like ```GPT-4o-mini``` and ```Ollama``` , you can evaluate the relevance of generated answers and questions effectively.\n",
        "\n",
        "### Table of Contents\n",
        "\n",
        "- [Overview](#overview)\n",
        "- [Environment Setup](#environment-setup)\n",
        "- [Defining a function for RAG performance testing](#defining-a-function-for-rag-performance-testing)\n",
        "- [Summary evaluator for relevance assessment](#summary-evaluator-for-relevance-assessment)\n",
        "\n",
        "### References\n",
        "\n",
        "- [How to define a summary evaluator](https://docs.smith.langchain.com/evaluation/how_to_guides/summary)\n",
        "----"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "c6c7aba4",
      "metadata": {},
      "source": [
        "## Environment Setup\n",
        "\n",
        "Setting up your environment is the first step. See the [Environment Setup](https://wikidocs.net/257836) guide for more details.\n",
        "\n",
        "\n",
        "**[Note]**\n",
        "\n",
        "The langchain-opentutorial is a package of easy-to-use environment setup guidance, useful functions and utilities for tutorials.\n",
        "Check out the  [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "id": "21943adb",
      "metadata": {},
      "outputs": [],
      "source": [
        "%%capture --no-stderr\n",
        "%pip install langchain-opentutorial"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "id": "f25ec196",
      "metadata": {},
      "outputs": [],
      "source": [
        "# Install required packages\n",
        "from langchain_opentutorial import package\n",
        "\n",
        "package.install(\n",
        "    [\n",
        "        \"langsmith\",\n",
        "        \"langchain\",\n",
        "        \"langchain_core\",\n",
        "        \"langchain_openai\",\n",
        "        \"langchain_ollama\",\n",
        "    ],\n",
        "    verbose=False,\n",
        "    upgrade=False,\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "690a9ae0",
      "metadata": {},
      "source": [
        "You can set API keys in a `.env` file or set them manually.\n",
        "\n",
        "[Note] If you’re not using the `.env` file, no worries! Just enter the keys directly in the cell below, and you’re good to go."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "id": "327c2c7c",
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Environment variables have been set successfully.\n"
          ]
        }
      ],
      "source": [
        "from dotenv import load_dotenv\n",
        "from langchain_opentutorial import set_env\n",
        "\n",
        "# Attempt to load environment variables from a .env file; if unsuccessful, set them manually.\n",
        "if not load_dotenv():\n",
        "    set_env(\n",
        "        {\n",
        "            \"OPENAI_API_KEY\": \"\",\n",
        "            \"LANGCHAIN_API_KEY\": \"\",\n",
        "            \"LANGCHAIN_TRACING_V2\": \"true\",\n",
        "            \"LANGCHAIN_ENDPOINT\": \"https://api.smith.langchain.com/\",\n",
        "            \"LANGCHAIN_PROJECT\": \"10-LangSmith-Summary-Evaluation\",  # set the project name same as the title\n",
        "        }\n",
        "    )"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "aa00c3f4",
      "metadata": {},
      "source": [
        "## Defining a Function for RAG Performance Testing\n",
        "\n",
        "We’ll create a RAG system for testing purposes."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 4,
      "id": "69cb77da",
      "metadata": {},
      "outputs": [],
      "source": [
        "from myrag import PDFRAG\n",
        "\n",
        "# Function to generate answers for questions\n",
        "def ask_question_with_llm(llm):\n",
        "    # Create a PDFRAG object\n",
        "    rag = PDFRAG(\n",
        "        \"data/Newwhitepaper_Agents2.pdf\",\n",
        "        llm,\n",
        "    )\n",
        "\n",
        "    # Create a retriever\n",
        "    retriever = rag.create_retriever()\n",
        "\n",
        "    # Create a chain\n",
        "    rag_chain = rag.create_chain(retriever)\n",
        "\n",
        "    def _ask_question(inputs: dict):\n",
        "        # Retrieve context for the question\n",
        "        context = retriever.invoke(inputs[\"question\"])\n",
        "        # Combine retrieved documents into a single string\n",
        "        context = \"\\n\".join([doc.page_content for doc in context])\n",
        "        # Return a dictionary with the question, context, and answer\n",
        "        return {\n",
        "            \"question\": inputs[\"question\"],\n",
        "            \"context\": context,\n",
        "            \"answer\": rag_chain.invoke(inputs[\"question\"]),\n",
        "        }\n",
        "\n",
        "    return _ask_question"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "23688249",
      "metadata": {},
      "source": [
        "We’ll create functions using ```GPT-4o-mini``` and ```Ollama model``` to answer questions."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 5,
      "id": "ae212002",
      "metadata": {},
      "outputs": [],
      "source": [
        "from langchain_openai import ChatOpenAI\n",
        "from langchain_ollama import ChatOllama\n",
        "\n",
        "gpt_chain = ask_question_with_llm(ChatOpenAI(model=\"gpt-4o-mini\", temperature=0))\n",
        "ollama_chain = ask_question_with_llm(ChatOllama(model=\"llama3.2:1b\"))"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "c34b7d12",
      "metadata": {},
      "source": [
        "The ```OpenAIRelevanceGrader``` evaluates the relevance of the **question** , **context** , and **answer** .\n",
        "- ```target=\"retrieval-question\"``` : Evaluates the relevance of the **question** to the **context** .\n",
        "- ```target=\"retrieval-answer\"``` : Evaluates the relevance of the **answer** to the **context** .\n",
        "\n",
        "We first need to define ```OpenAIRelevanceGrader```."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 6,
      "id": "0519a1bb",
      "metadata": {},
      "outputs": [],
      "source": [
        "from langchain_core.prompts import ChatPromptTemplate\n",
        "from pydantic import BaseModel, Field\n",
        "from langchain_core.output_parsers import PydanticOutputParser\n",
        "from langchain_core.prompts import PromptTemplate\n",
        "\n",
        "\n",
        "# Data Models\n",
        "class GradeRetrievalQuestion(BaseModel):\n",
        "    \"\"\"A binary score to determine the relevance of the retrieved documents to the question.\"\"\"\n",
        "\n",
        "    score: str = Field(\n",
        "        description=\"Whether the retrieved context is relevant to the question, 'yes' or 'no'\"\n",
        "    )\n",
        "\n",
        "\n",
        "# Data Models\n",
        "class GradeRetrievalAnswer(BaseModel):\n",
        "    \"\"\"A binary score to determine the relevance of the retrieved documents to the answer.\"\"\"\n",
        "\n",
        "    score: str = Field(\n",
        "        description=\"Whether the retrieved context is relevant to the answer, 'yes' or 'no'\"\n",
        "    )\n",
        "\n",
        "\n",
        "class OpenAIRelevanceGrader:\n",
        "    \"\"\"\n",
        "    OpenAI-based relevance grader class.\n",
        "\n",
        "    This class evaluates how relevant a retrieved document is to a given question or answer.\n",
        "    It operates in two modes: 'retrieval-question' or 'retrieval-answer'.\n",
        "\n",
        "    Attributes:\n",
        "        llm: The language model instance to use\n",
        "        structured_llm_grader: LLM instance generating structured outputs\n",
        "        grader_prompt: Prompt template for evaluation\n",
        "\n",
        "    Args:\n",
        "        llm: The language model instance to use\n",
        "        target (str): Target of the evaluation ('retrieval-question' or 'retrieval-answer')\n",
        "    \"\"\"\n",
        "\n",
        "    def __init__(self, llm, target=\"retrieval-question\"):\n",
        "        \"\"\"\n",
        "        Initialization method for the OpenAIRelevanceGrader class.\n",
        "\n",
        "        Args:\n",
        "            llm: The language model instance to use\n",
        "            target (str): Target of the evaluation ('retrieval-question' or 'retrieval-answer')\n",
        "\n",
        "        Raises:\n",
        "            ValueError: Raised if an invalid target value is provided\n",
        "        \"\"\"\n",
        "        self.llm = llm\n",
        "\n",
        "        if target == \"retrieval-question\":\n",
        "            self.structured_llm_grader = llm.with_structured_output(\n",
        "                GradeRetrievalQuestion\n",
        "            )\n",
        "        elif target == \"retrieval-answer\":\n",
        "            self.structured_llm_grader = llm.with_structured_output(\n",
        "                GradeRetrievalAnswer\n",
        "            )\n",
        "        else:\n",
        "            raise ValueError(f\"Invalid target: {target}\")\n",
        "\n",
        "        # Prompt\n",
        "        target_variable = (\n",
        "            \"user question\" if target == \"retrieval-question\" else \"answer\"\n",
        "        )\n",
        "        system = f\"\"\"You are a grader assessing relevance of a retrieved document to a {target_variable}. \\n \n",
        "            It does not need to be a stringent test. The goal is to filter out erroneous retrievals. \\n\n",
        "            If the document contains keyword(s) or semantic meaning related to the {target_variable}, grade it as relevant. \\n\n",
        "            Give a binary score 'yes' or 'no' score to indicate whether the document is relevant to {target_variable}.\"\"\"\n",
        "\n",
        "        grade_prompt = ChatPromptTemplate.from_messages(\n",
        "            [\n",
        "                (\"system\", system),\n",
        "                (\n",
        "                    \"human\",\n",
        "                    f\"Retrieved document: \\n\\n {{context}} \\n\\n {target_variable}: {{input}}\",\n",
        "                ),\n",
        "            ]\n",
        "        )\n",
        "        self.grader_prompt = grade_prompt\n",
        "\n",
        "    def create(self):\n",
        "        \"\"\"\n",
        "        Creates and returns the relevance grader.\n",
        "\n",
        "        Returns:\n",
        "            Chain object for performing relevance evaluation\n",
        "        \"\"\"\n",
        "\n",
        "        retrieval_grader_oai = self.grader_prompt | self.structured_llm_grader\n",
        "        return retrieval_grader_oai\n",
        "\n",
        "\n",
        "class GroundnessQuestionScore(BaseModel):\n",
        "    \"\"\"Binary scores for relevance checks\"\"\"\n",
        "\n",
        "    score: str = Field(\n",
        "        description=\"relevant or not relevant. Answer 'yes' if the answer is relevant to the question else answer 'no'\"\n",
        "    )\n",
        "\n",
        "\n",
        "class GroundnessAnswerRetrievalScore(BaseModel):\n",
        "    \"\"\"Binary scores for relevance checks\"\"\"\n",
        "\n",
        "    score: str = Field(\n",
        "        description=\"relevant or not relevant. Answer 'yes' if the answer is relevant to the retrieved document else answer 'no'\"\n",
        "    )\n",
        "\n",
        "\n",
        "class GroundnessQuestionRetrievalScore(BaseModel):\n",
        "    \"\"\"Binary scores for relevance checks\"\"\"\n",
        "\n",
        "    score: str = Field(\n",
        "        description=\"relevant or not relevant. Answer 'yes' if the question is relevant to the retrieved document else answer 'no'\"\n",
        "    )\n",
        "\n",
        "\n",
        "class GroundednessChecker:\n",
        "    \"\"\"\n",
        "    GroundednessChecker class evaluates the accuracy of a document.\n",
        "\n",
        "    This class evaluates whether a given document is accurate.\n",
        "    It returns one of two values: 'yes' or 'no'.\n",
        "\n",
        "    Attributes:\n",
        "        llm (BaseLLM): The language model instance to use\n",
        "        target (str): Evaluation target ('retrieval-answer', 'question-answer', or 'question-retrieval')\n",
        "    \"\"\"\n",
        "\n",
        "    def __init__(self, llm, target=\"retrieval-answer\"):\n",
        "        \"\"\"\n",
        "        Constructor for the GroundednessChecker class.\n",
        "\n",
        "        Args:\n",
        "            llm (BaseLLM): The language model instance to use\n",
        "            target (str): Evaluation target ('retrieval-answer', 'question-answer', or 'question-retrieval')\n",
        "        \"\"\"\n",
        "        self.llm = llm\n",
        "        self.target = target\n",
        "\n",
        "    def create(self):\n",
        "        \"\"\"\n",
        "        Creates a chain for groundedness evaluation.\n",
        "\n",
        "        Returns:\n",
        "            Chain: Object for performing groundedness evaluation\n",
        "        \"\"\"\n",
        "        # Parser\n",
        "        if self.target == \"retrieval-answer\":\n",
        "            llm = self.llm.with_structured_output(GroundnessAnswerRetrievalScore)\n",
        "        elif self.target == \"question-answer\":\n",
        "            llm = self.llm.with_structured_output(GroundnessQuestionScore)\n",
        "        elif self.target == \"question-retrieval\":\n",
        "            llm = self.llm.with_structured_output(GroundnessQuestionRetrievalScore)\n",
        "        else:\n",
        "            raise ValueError(f\"Invalid target: {self.target}\")\n",
        "\n",
        "        # Prompt selection\n",
        "        if self.target == \"retrieval-answer\":\n",
        "            template = \"\"\"You are a grader assessing relevance of a retrieved document to a user question. \\n \n",
        "                Here is the retrieved document: \\n\\n {context} \\n\\n\n",
        "                Here is the answer: {answer} \\n\n",
        "                If the document contains keyword(s) or semantic meaning related to the user answer, grade it as relevant. \\n\n",
        "                \n",
        "                Give a binary score 'yes' or 'no' score to indicate whether the retrieved document is relevant to the answer.\"\"\"\n",
        "            input_vars = [\"context\", \"answer\"]\n",
        "\n",
        "        elif self.target == \"question-answer\":\n",
        "            template = \"\"\"You are a grader assessing whether an answer appropriately addresses the given question. \\n\n",
        "                Here is the question: \\n\\n {question} \\n\\n\n",
        "                Here is the answer: {answer} \\n\n",
        "                If the answer directly addresses the question and provides relevant information, grade it as relevant. \\n\n",
        "                Consider both semantic meaning and factual accuracy in your assessment. \\n\n",
        "                \n",
        "                Give a binary score 'yes' or 'no' score to indicate whether the answer is relevant to the question.\"\"\"\n",
        "            input_vars = [\"question\", \"answer\"]\n",
        "\n",
        "        elif self.target == \"question-retrieval\":\n",
        "            template = \"\"\"You are a grader assessing whether a retrieved document is relevant to the given question. \\n\n",
        "                Here is the question: \\n\\n {question} \\n\\n\n",
        "                Here is the retrieved document: \\n\\n {context} \\n\n",
        "                If the document contains information that could help answer the question, grade it as relevant. \\n\n",
        "                Consider both semantic meaning and potential usefulness for answering the question. \\n\n",
        "                \n",
        "                Give a binary score 'yes' or 'no' score to indicate whether the retrieved document is relevant to the question.\"\"\"\n",
        "            input_vars = [\"question\", \"context\"]\n",
        "\n",
        "        else:\n",
        "            raise ValueError(f\"Invalid target: {self.target}\")\n",
        "\n",
        "        # Create the prompt\n",
        "        prompt = PromptTemplate(\n",
        "            template=template,\n",
        "            input_variables=input_vars,\n",
        "        )\n",
        "\n",
        "        # Chain\n",
        "        chain = prompt | llm\n",
        "        return chain"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "12d6d71a",
      "metadata": {},
      "source": [
        "Then, set ```retrieval-question grader``` and ```retriever-answer grader```."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 7,
      "id": "ccd62891",
      "metadata": {},
      "outputs": [],
      "source": [
        "from langchain_openai import ChatOpenAI\n",
        "\n",
        "\n",
        "rq_grader = OpenAIRelevanceGrader(\n",
        "    llm=ChatOpenAI(model=\"gpt-4o-mini\", temperature=0), target=\"retrieval-question\"\n",
        ").create()\n",
        "\n",
        "ra_grader = OpenAIRelevanceGrader(\n",
        "    llm=ChatOpenAI(model=\"gpt-4o-mini\", temperature=0), target=\"retrieval-answer\"\n",
        ").create()"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "07a2454d",
      "metadata": {},
      "source": [
        "Invoke the graders."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 8,
      "id": "1b792d72",
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/plain": [
              "GradeRetrievalAnswer(score='yes')"
            ]
          },
          "execution_count": 8,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "rq_grader.invoke(\n",
        "    {\n",
        "        \"input\": \"What are the three targeted learnings to enhance model performance?\",\n",
        "        \"context\": \"\"\"\n",
        "        The three targeted learning approaches to enhance model performance mentioned in the context are:\n",
        "            1. Pre-training-based learning\n",
        "            2. Fine-tuning based learning\n",
        "            3. Using External Memory\n",
        "        \"\"\",\n",
        "    }\n",
        ")\n",
        "\n",
        "ra_grader.invoke(\n",
        "    {\n",
        "        \"input\": \"\"\"\n",
        "        The three targeted learning approaches to enhance model performance mentioned in the context are:\n",
        "            1. In-context learning\n",
        "            2. Fine-tuning based learning\n",
        "            3. Using External Memory\n",
        "        \"\"\",\n",
        "        \"context\": \"\"\"\n",
        "        The three targeted learning approaches to enhance model performance mentioned in the context are:\n",
        "            1. Pre-training-based learning\n",
        "            2. Fine-tuning based learning\n",
        "            3. Using External Memory\n",
        "        \"\"\",\n",
        "    }\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "4ccb245d",
      "metadata": {},
      "source": [
        "## Summary Evaluator for Relevance Assessment\n",
        "\n",
        "Certain metrics can only be defined at the experiment level rather than for individual runs of an experiment.\n",
        "\n",
        "For example, you may need to **compute the evaluation score of a classifier across all runs** derived from a dataset.\n",
        "\n",
        "This is referred to as ```summary_evaluators```.\n",
        "\n",
        "These evaluators take lists of Runs and Examples instead of single instances."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 9,
      "id": "2fd85171",
      "metadata": {},
      "outputs": [],
      "source": [
        "from typing import List\n",
        "from langsmith.schemas import Example, Run\n",
        "from langsmith.evaluation import evaluate\n",
        "\n",
        "def relevance_score_summary_evaluator(runs: List[Run], examples: List[Example]) -> dict:\n",
        "    rq_scores = 0  # Question relevance score\n",
        "    ra_scores = 0  # Answer relevance score\n",
        "\n",
        "    for run, example in zip(runs, examples):\n",
        "        question = example.inputs[\"question\"]\n",
        "        context = run.outputs[\"context\"]\n",
        "        prediction = run.outputs[\"answer\"]\n",
        "\n",
        "        # Evaluate question relevance\n",
        "        rq_score = rq_grader.invoke(\n",
        "            {\n",
        "                \"input\": question,\n",
        "                \"context\": context,\n",
        "            }\n",
        "        )\n",
        "        # Evaluate answer relevance\n",
        "        ra_score = ra_grader.invoke(\n",
        "            {\n",
        "                \"input\": prediction,\n",
        "                \"context\": context,\n",
        "            }\n",
        "        )\n",
        "\n",
        "        # Accumulate relevance scores\n",
        "        if rq_score.score == \"yes\":\n",
        "            rq_scores += 1\n",
        "        if ra_score.score == \"yes\":\n",
        "            ra_scores += 1\n",
        "\n",
        "    # Calculate the final relevance score (average of question and answer relevance)\n",
        "    final_score = ((rq_scores / len(runs)) + (ra_scores / len(runs))) / 2\n",
        "\n",
        "    return {\"key\": \"relevance_score\", \"score\": final_score}"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "8554cd1d",
      "metadata": {},
      "source": [
        "Now, Let's evaluate."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 10,
      "id": "09596bdc",
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "View the evaluation results for experiment: 'SUMMARY_EVAL-6a120022' at:\n",
            "https://smith.langchain.com/o/9089d1d3-e786-4000-8468-66153f05444b/datasets/9b4ca107-33fe-4c71-bb7f-488272d895a3/compare?selectedSessions=09484df0-8405-4452-9fda-0dc268a0de44\n",
            "\n",
            "\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "9348aa7b63f645e184ec0f744abe53f8",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "0it [00:00, ?it/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "View the evaluation results for experiment: 'SUMMARY_EVAL-a04c0d31' at:\n",
            "https://smith.langchain.com/o/9089d1d3-e786-4000-8468-66153f05444b/datasets/9b4ca107-33fe-4c71-bb7f-488272d895a3/compare?selectedSessions=0ed4c0f5-1fff-4ea4-ac81-7cafe03a66d7\n",
            "\n",
            "\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "48d2d879091f432585c4d09b1bc40cd7",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "0it [00:00, ?it/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "source": [
        "dataset_name = \"RAG_EVAL_DATASET\"\n",
        "\n",
        "experiment_result1 = evaluate(\n",
        "    gpt_chain,\n",
        "    data=dataset_name,\n",
        "    summary_evaluators=[relevance_score_summary_evaluator],\n",
        "    experiment_prefix=\"SUMMARY_EVAL\",\n",
        "    metadata={\n",
        "        \"variant\": \"Using GPT-4o-mini: relevance evaluation with summary_evaluator\",\n",
        "    },\n",
        ")\n",
        "\n",
        "experiment_result2 = evaluate(\n",
        "    ollama_chain,\n",
        "    data=dataset_name,\n",
        "    summary_evaluators=[relevance_score_summary_evaluator],\n",
        "    experiment_prefix=\"SUMMARY_EVAL\",\n",
        "    metadata={\n",
        "        \"variant\": \"Using Ollama(llama3.2): relevance evaluation with summary_evaluator\",\n",
        "    },\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "8b511348",
      "metadata": {},
      "source": [
        "Check the result.\n",
        "\n",
        "[ **Note** ]  \n",
        "Results are not available for individual datasets but can be reviewed at the experiment level."
      ]
    },
    {
      "cell_type": "markdown",
      "id": "5783394e",
      "metadata": {},
      "source": [
        "![](./assets/10-LangSmith-Summary-Evaluation-01.png)"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "langchain-kr-lwwSZlnu-py3.11",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.11.11"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 5
}
